1 Introduction

1.1 Project Intention

Currently, Zillow does not factor enough local intelligence in producing its housing market predictions, which makes the predictions less accurate than they could be. This project aims for creating a linear regression model to better predict home sales prices in Boston. The model will provide estimated elasticities(multipliers) of a set of variables, showing the magnitude of their influences on home sales prices, which could be used for predicting home prices in locations where home prices data are not accessible. Ideally, the model should produce predictions that not only achieve best “average accuracy”, but also have relatively consistent accuracy in different locations. In other words, the goal is to create a model that has both overall goodness of fit of predicting sample data and generalizability to data that we haven’t seen yet.

To hit a balance between these two is very important for a predictive model, since the value of a predictive model is not to best explain the relationships of interest in the area where we have data, but to use existing data to simulate the relationships of interest in places we do not have data. In reality, home prices in different locations are influenced by specific characteristics attached to the places, which we cannot catch by a single model. Balancing between overall goodness of fit and generalizability would maximize the capture of experiences that are valid across space, namely, those experiences that are valid both in places we have data for and those we do not. This gives the model strong predictive power.

1.2 Challenges of the Project

The project is challenging in this way, since when we make the model better at predicting home prices we have data for, we are also running the risk of overfitting and making it less effective in predicting home prices we do not have data for. Furthermore, variables like home prices are not just influenced by spatial features independently, they are also recognized as having positively spatial autocorrelation. That is to say, home prices tend to be similar with each other at close places because they are also influenced by each other. To make the model successful, we should also identify variables that can account for the spatial autocorrelation between home prices.

1.3 Overall modeling strategy

The first step of developing the model is to collect data for a collection of candidate variables, which based on our knowledge about theories of housing prices, could be categorized as the following three groups: 1) Internal housing characteristics 2) Amenities, public services, and socio-economic environment 3) Home prices nearby The strategy to select the variables for running linear regression with home prices is to do some exploratory analysis, such as using scatterplot, exploratory statistics, GIS overlay mapping as tools to identify the most effective variables. Then, run the linear regression iteratively to add or delete variables in the model based on the changes in goodness of fit and generalizability, until we obtain a model that is satisfactory in both of these two criteria.

1.4 Summary of Model Results:

Our final model could explain 87.7% of variation in home prices in our dataset. The model is generalizable in different locations, although it performs slightly better in poor and middle-income neighborhoods. There are 27 significant variables included in our final model, of which the most influential variables are neighborhood fixed effects, whether in special zoning districts (restricted parking/neighborhood design/Planned Development Area), road density in residing census tract, whether have a AC, and vacancy rate of residing census tract.

2 Data Exploration

2.1 Data Gathering Methods

Apart from the internal housing characteristics that attached to the sales information for each house, we gathered amenities, public services, socio-economic environment and neightboring home sales variables based on ACS and Boston Open Data through series of spatial joins and calculations in both R and ArcGIS. The socio-economic environment variables for each house are set to the value of the census tract it locates in. For consistency, we used 2011-2015 ACS as current year and compare it with 2006-2010 ACS to calculate changes and percent changes. The measurements of public services and amenities are distances or counts within specific distance generated by spatial joins. For a better prediction, we also regroup some of the housing characteristics, for instance, exterior finishes, property types, sale season and so on.

2.2 Data Description

In total, we included 27 variables to predict home prices. 7 of them are categories, and the rest are numeric variables. Categorical variables include Property Type, Residential Exemption (Y/N), Building Style, Exterior Finish Structure, Air Conditioning Type, Off-Season Sales (Y/N), and Spatial Zoning District (Y/N).

The rest of the variables relate to land areas, levels, rooms, and decorations of the house itself, distance to differet services, accessbilities, and demographic and economic characterisctics of its location and sales prices of its neighboring buildings. A brief statistical description of these continuous variables as well as the dependent variables is shown below. They are all self-explanary by their names.

Summary of Continuous Variable
Minimum Maximum 1st Quartile 3rd Quartile Mean Median
Sale Price 200,000 11,600,000 415,000 650,000 642,767.900 519,500
log(Sale Price) 12.206 16.267 12.936 13.385 13.229 13.161
Parcel’sLandArea 498 63,941 2,519 5,650 4,553.962 4,222
LatestRemodeledYear 0 2,013 0 1,998 755.185 0
TotalLivableArea 573 9,908 1,469 2,922 2,280.976 2,100
Num.OfFloors 1 4 2 3 2.181 2
Num.OfBedrooms 1 14 3 6 4.497 4
Num.OfFullBath 1 8 1 3 1.995 2
Num.OfHalfBath 0 3 0 1 0.357 0
Num.OfFireplace 0 5 0 1 0.405 0
2015CT MedIncome 24,286 121,096 30,943 93,819 52,434.810 30,943
CT Road_Density 0.010 5.011 0.035 0.103 0.087 0.066
Dist_Non-publicSchool 371.547 2,681.292 888.087 1,436.417 1,178.495 1,118.950
2015CT EmploymentDensity 272.487 197,897.800 881.841 3,375.256 4,034.451 2,029.477
2015CT ShareOfBachelor 0.054 0.951 0.214 0.586 0.401 0.369
2015CT MedHomeValue 0 1,019,700 315,800 416,500 383,850.000 353,000
2015CT VacancyRate 0 0.237 0.040 0.086 0.065 0.061
2010-2015CT ShareOfBachelor_Chg -0.351 1.810 0.027 0.531 0.301 0.223
2010-2015CT MedGrossRent_Chg -467 2,355 11 202 132.002 105
Num.OfCrimes Within0.5mile 83 13,165 1,150 5,193 3,417.238 2,600
Num.OfRestaurant Within1mile 4 778 28 74 69.733 46
Avg.SalesPrice Within0.25mile 269,666.700 11,600,000 451,000 644,786 644,814.700 517,285.700

2.3 Correlation

To generate a least-but-best prediction, we want to make sure that each predictor can explain some of the variances in dependent varible, but cannot be linearly explained by other dependant variables. Therefore, we checked the correlation among dependent variable and all the independent variables. The correlation matix are shown as below. The criteria we use to select independent variables are their correlation with the dependent variable and that with other independent variables.

Correlation Matrix of Continuous Variables and Dependent Variable

These independent variables are correlated with both sales price and log of sales price, but they are not that correlated with each other.

2.4 Dependent Variables

The map below illustrates how the sale prices distributes spatially, which is one of basis for finding predictors. It is also a justification for possible spatial autocorrelations, since high prices tend to cluster together and low prices tend to cluster together.

According to the histograms below, sales prices do not form a normal distributiuon, which does not meet the requirements of OLS regression. Therefore, we use log transformed value of sale prices, which distribute normally, as dependent variable.

2.5 Independent Variables

The justifications of some independent variables are shown below.

Refer to the previous sales price map, houses close to more restaurants are more expensive. People are more likely to pay for better amenities.

According to the map above, houses close to non-public schools have higher value. People are more likely to pay for better school quality.

Sales Price by Neighborhood

According to the map above, houses locate in west and north neighborhood are generally more expensive. Similar housing prices are more likely to cluster together.

The scatterplot above illustrates housing price are correlated with road density. Houses with better connectivity or accessibility are generally more expensive.

According to the boxplot above, different exterior finishes correlates with housing prices.

3 Methods

As mentioned before, the final model is selected after iterative feeding different variables into the calculation and observing the changes in the overall goodness of fit and generalizability.

The overall goodness of fit is judged by the adjusted R-squared of the final model, the percentage of the variations in home prices captured by the model, calculated based on the home sales prices in sample data.

To get a sense of the model’s generalizability, 75% of the sample will be randomly selected and used as training dataset to train the model–calculate the coefficient of each selected variable in the model, while the rest 25% will perform as a test dataset to validate the model generated by the training dataset. The coefficient results in the model are applied to the test dataset to calculate the predicted home sales prices. By comparing the actual home prices in the test dataset with the predicted values, we could know whether the model is generalizable or not. The goal is to minimize the absolute percent difference between the predicted values and observed values in both the training dataset and test dataset in the sample. A Moran’s I test, a test to determine whether values located nearby have statistically significant correlation, is conducted on the errors of predictions, so that we could know whether our model significantly overpredicts or underpredicts home prices at locations close to each other. A good model should generate a random pattern in the errors and have a p-value of Moran’s I test that is larger than 0.05.

Still, the generalizability of the model indicated by the one-time randomly holdout validation as described above could be just out of lucky– the random sample test set we took just happens to be very similar to the training set. To ensure the generalizability of the model, the method of ten-fold cross-validation is used, which generates ten random sub-samples out of our sample data, and conduct 10 holdout test validations. A generalizable model would have similar R-squared across the ten validations.

Also, to examine whether our model is good for different neighborhoods, the MAPE (average percent absolute error of predictions) is calculated for each neighborhood in Boston and visualized on a map. The desired outcome is that the MAPE is relatively close across different neighborhoods. Furthermore, spatial cross-validation is conducted three times, each time we holdout a subsample of home prices from rich neighborhoods, poor neighborhoods, or middle-income neighborhoods. Again, a generalizable model would have similar MAPE (average percent absolute error of predictions) in the three validations.

4 Model Results

4.1 Model Summary

Based on the training set, our final model is

Summary of In-Sample Model Results
Dependent variable:
Log(Sales Price)
Property Type 104 0.037** (0.016)
Property Type 105 0.069*** (0.025)
With Residential Exemption 0.024** (0.010)
Parcel’s Land Area 0.00001*** (0.00000)
Latest Remodeled Year 0.00002*** (0.00000)
Total Livable Area 0.0001*** (0.00001)
Num Of Floors 0.057*** (0.012)
Num Of Bedrooms 0.013*** (0.004)
Exterior Finish Structure M -0.073*** (0.014)
Exterior Finish Structure W -0.072*** (0.016)
Num Of Full Bath 0.034*** (0.010)
Num Of Half Bath 0.026*** (0.010)
Type of Air Conditioning N -0.068*** (0.015)
Num Of Fireplace 0.042*** (0.008)
2015CT Median Income 0.00000** (0.00000)
CT Road Density -0.248* (0.127)
Distance to Non-publicSchool -0.0001*** (0.00002)
2015CT Employment Density -0.00000** (0.00000)
2015CT Share Of Bachelor Degree 0.492*** (0.058)
2015CT Median Home Value 0.00000** (0.00000)
2015CT Vacancy Rate -0.475** (0.211)
2010-2015CT Share Of Bachelor Degree Change -0.040*** (0.015)
2010-2015CT Median Gross Rent Change -0.00003* (0.00002)
Num Of Crimes Within 0.5 mile -0.00003*** (0.00000)
Num Of Restaurant Within 1 mile 0.001*** (0.0002)
Off Season Sales -0.048*** (0.012)
Spatial Zoning District 0.050*** (0.015)
Neighborhood Back Bay -0.882*** (0.292)
Neighborhood Bay Village 1.384** (0.634)
Neighborhood Beacon Hill -0.123 (0.164)
Neighborhood Brighton -0.223** (0.098)
Neighborhood Charlestown -0.121 (0.090)
Neighborhood Dorchester -0.312*** (0.078)
Neighborhood Downtown 0.041 (0.242)
Neighborhood East Boston -0.411*** (0.095)
Neighborhood Fenway 0.220 (0.149)
Neighborhood Hyde Park -0.451*** (0.080)
Neighborhood Jamaica Plain -0.195** (0.077)
Neighborhood Mattapan -0.364*** (0.079)
Neighborhood Mission Hill -0.063 (0.098)
Neighborhood Roslindale -0.345*** (0.078)
Neighborhood Roxbury -0.366*** (0.082)
Neighborhood South Boston -0.216** (0.090)
Neighborhood South End 0.146 (0.129)
Neighborhood West Roxbury -0.326*** (0.079)
Average Sales Price Within 0.25 mile 0.00000*** (0.00000)
Constant 12.756*** (0.102)
Observations 1,313
R2 0.881
Adjusted R2 0.877
Residual Std. Error 0.162 (df = 1266)
F Statistic 203.590*** (df = 46; 1266)
Note: p<0.1; p<0.05; p<0.01

4.2 75% Random Selected Training and Test Set Justification

The r-square, root mean square error, mean absolute error and MAPE for the training set is

Training Set Model Summary
Rsquared RootMeanSquareError MeanAbsError MeanAbsPercentError
1 0.881 382,326.000 91,715.490 0.121

The r-square, root mean square error, mean absolute error and MAPE for the test set is

Test Set Summary
Rsquared RootMeanSquareError MeanAbsError MeanAbsPercentError
1 0.926 115,458.800 75,259.870 0.126

The comparison between the error in both training set and random set suggest that our model is generalizable, based on the fact that not much statistical difference exists in the two sets.

4.3 Cross-Validation

However, since we only pull out one random set from our model, there still exists some occasional factors. Therefore, we use cross-validation to pull out 10 times and check its mean and variance in Rsquared and MAE.

Cross-Validation Summary
RootMeanSquareError Rsquared MeanAbsError
Mean 0.217 0.805 0.135
StandardDeviation 0.097 0.145 0.017

According to the two histograms above, we can conclude that the model is overall generalizable. R-squared tends to be clustered in high value whereas MAE tends to be clusted in low value, even though some variances exist.

4.4 Regression Diagnose Based on Randomly Selected Training and Test Set

We also map the the 25% ramdomly selected training set residual as a function of observed and predicted value respectively. The residual stays almost constant with the increase of predicted sales price suggesting a good fit of the model. However, the residual increase with the increase of observed sales price, suggesting that our model failed to include some other variables to explain this variance in the test set. However, the amount is comparatively small compared with our magnitude of our sales prices.

4.5 Spatial Autocorrelation

We also conducted moran’s I test for our test set to check spatial autocorrelation in our model. According to the test, there’s no significant spatial autocorrelation in our test set.

25% Ramdomly Selected Test Set Moran’s I Test Summary
standard deviation p.value
Moran’s I statistic -2.341 0.990

For a better illustration, we map the residuals for the training set and group the mean absolute percent error by neighborhood.

According to theses map, the residual are generally randomly distributed.

4.6 Spatial Cross-Validation

For a more precise check of spatial autocorrelation, we pull out a rich, median, and poor neighborhood respectively, build regression for the rest and test them by the pulled out set.

Spatial Cross-Validation Model Summary
RootMeanSquareError MAPE
holdOutRich 486,488.400 0.138
holdOutPoor 652,038.400 0.089
holdOutMiddle 481,910.300 0.108

According to the three histograms, the MAPE for all three test are generally low. But the predictive power for our model have some variance on different neighborhood.

5 Discussion

Overall, we think this is an effective model. According to the standardized coefficients bar chart, of the ten most influential variables in the model, six are neighborhood fixed effects variable, the rest four are whether in special zoning districts(restricted parking/neighborhood design/Planned Development Area), road density in residing census tract, whether having no AC, and vacancy rate in residing census tract. For the top five, the coefficients could be interpreted as follows: 1) Neighborhood West Roxbury: Assuming all else equal, home prices go down 38.54% on average when located in the neighborhood West Roxbury. 2) Whether in special zoning districts (restricted parking/neighborhood design/Planned Development Area): Assuming all else equal, home prices go up 5.13% on average when located in either one of these three special zones: restricted parking, neighborhood design, and planned development area. 3) Road density in residing census tract: Assuming all else equal, home prices go down 28.15% on average as the number of roads per square mile in the residing census tract increases by 1. 4) Neighborhood South Boston: Assuming all else equal, home prices go down 24.11% on average when located in the neighborhood South Boston. 5) None Air Conditioning: Assuming all else equal, home prices go down 7.04% on average when there is no AC inside.

The map for model residuals of test dataset and the map for MAPE (mean absolute percentage error) by neighborhood both indicate that our model account for the spatial variation in prices to a certain extent, there are slightly discernable spatial patterns, but not very pronouncing. The spatial cross-validation results show that the model performs slightly better in predicting home prices at middle-income and poor neighborhoods than at rich neighborhoods.

There are some other interesting variables we did not incorporate into the model. For example, the accessibility to good-quality schools and accessibility to high-rating restaurants. As mentioned before, the model is better at predicting home prices in middle-income and poor neighborhoods than predicting those in rich neighborhoods. Though our final model includes distance to non-public schools and number of restaurants within 1 mile as two significant variables, they are less specific in accounting for the influence of good amenities in home prices. After all, in many cases of the real world, it is not the general high accessibility to amenities that makes home prices surge, but the high accessibility to really good amenities, such as reputable elementary schools and five-star restaurants. For another, the influence of amenities, and other socio-economic variables, such as distance to nearest college, distance to nearest transit station probably have significant non-linear relationships with home prices, which is also suggested by the scatterplots between them and home prices. In other words, home prices increase exponentially when distance to nearest college or nearest transit station reduces after certain threshold. But in a linear regression model, these variables were kicked out because they are not significantly correlated with home prices in a linear way. This means the most powerful part for explaining those top highest home prices is missed in the model, which is another important reason why the model is performing relatively poorly in rich neighborhoods.

6 Conclusion

Even though the model is not perfect due to some limitations, we would still like to recommend the model to Zillow. First of all, the model has a reasonably goodness of fit and generalizability. It also accounts for the influence of home prices on nearby home prices. In general, the model has good predictive power.

To further improve the model, several options could be tried:
1) Obtain more data of accessibility to amenities with good quality that could drive up the home prices very fast to improve the predictive accuracy at highest home prices.
2) Consider using non-linear independent variables, such as squared and log variables to better simulate the relationships with home prices.
3) Reduce the geographic scope of spatial lag variable—home sales prices within a certain buffer. In our final model, the scope is a quarter mile buffer, this buffer could be smaller, since home prices could be very different from block to block.